Feature selection with a genetic algorithm can help improve the distinguishing power of microbiota information in monozygotic twins' identification

Introduction Personal identification of monozygotic twins (MZT) has been challenging in forensic genetics. Previous research has demonstrated that microbial markers have potential value due to their specificity and long-term stability. However, those studies would use the complete information of detected microbial communities, and low-value species would limit the performance of previous models. Methods To address this issue, we collected 80 saliva samples from 10 pairs of MZTs at four different time points and used 16s rRNA V3–V4 region sequencing to obtain microbiota information. The data formed 280 inner-individual (Self) or MZT sample pairs, divided into four groups based on the individual relationship and time interval, and then randomly divided into training and testing sets with an 8:2 ratio. We built 12 identification models based on the time interval ( ≤ 1 year or ≥ 2 months), data basis (Amplicon sequence variants, ASVs or Operational taxonomic unit, OTUs), and distance parameter selection (Jaccard distance, Bray-Curist distance, or Hellinger distance) and then improved their identification power through genetic algorithm processes. The best combination of databases with distance parameters was selected as the final model for the two types of time intervals. Bayes theory was introduced to provide a numerical indicator of the evidence's effectiveness in practical cases. Results From the 80 saliva samples, 369 OTUs and 1130 ASVs were detected. After the feature selection process, ASV-Jaccard distance models were selected as the final models for the two types of time intervals. For short interval samples, the final model can completely distinguish MZT pairs from Self ones in both training and test sets. Discussion Our findings support the microbiota solution to the challenging MZT identification problem and highlight the importance of feature selection in improving model performance.

Introduction: Personal identification of monozygotic twins (MZT) has been challenging in forensic genetics. Previous research has demonstrated that microbial markers have potential value due to their specificity and long-term stability. However, those studies would use the complete information of detected microbial communities, and low-value species would limit the performance of previous models.
Methods: To address this issue, we collected saliva samples from pairs of MZTs at four di erent time points and used s rRNA V -V region sequencing to obtain microbiota information. The data formed inner-individual (Self) or MZT sample pairs, divided into four groups based on the individual relationship and time interval, and then randomly divided into training and testing sets with an : ratio. We built identification models based on the time interval (≤ year or ≥ months), data basis (Amplicon sequence variants, ASVs or Operational taxonomic unit, OTUs), and distance parameter selection (Jaccard distance, Bray-Curist distance, or Hellinger distance) and then improved their identification power through genetic algorithm processes. The best combination of databases with distance parameters was selected as the final model for the two types of time intervals. Bayes theory was introduced to provide a numerical indicator of the evidence's e ectiveness in practical cases.

. Introduction
Distinguishing monozygotic twins (MZT) has long been a challenging task in individual identification, and it has significant implications in some cases. For example, Jobling (2013) introduced several of such cases, including a rape case where the characterization was obstructed by the existence of the suspect's MZT brother and the difficulty of distinguishing them. Various techniques have been established to identify MZTs. One such technique involves the analysis of Single Nucleotide Polymorphism (SNP), Count Number Variation (CNV), and Insertion/Deletion variation (InDel) markers (Hannelius et al., 2007;Abdellaoui et al., 2015;Xu et al., 2017). Researchers have reported that there may be differences in these genomic biomarkers between MZTs. However, due to the nearly identical nuclear and mitochondrial genome information shared by MZTs, such differences are infrequent (McRae et al., 2015;Ming et al., 2019) and genome-wide sequencing is typically necessary to differentiate MZTs. Another MZT identification strategy involves the investigation of various biomarkers from the perspective of epigenetics, including DNA methylation (Stewart et al., 2015;Xu et al., 2015) and microRNA (Fang et al., 2019;Xiao et al., 2019). Applying such biomarkers, the difference between MZTs would be more significant and the identification efficacy can be promising (Xu et al., 2015). However, using such biomarkers would necessitate strict material standards which would limit their application in daily work. Specially, RNA is easily degraded by exonuclease hydrolysis in the environment and DNA methylation method requires high quality and quantity of DNA because sodium bisulfite may lead to DNA fragmentation and loss. Therefore, more suitable biomarkers for MZT identification are still required.
Advances in microbiology may provide another possible solution to the MZT identification problem. As it is known, human bodies contain two genomes: the human genome is inherited from parents, and the microbiota reside on/in the human body. The latter, known as the "second genome" of the human body, is thought to contain genetic information that is ∼50-100 times greater than the human genome (Grice and Segre, 2012). Recent advances in sequencing technology have helped the growth of forensic microbiology, an interdisciplinary field of forensic medicine and microbiology (Ventura Spagnolo et al., 2019;Speruda et al., 2022). With a sharp focus much attention has been focused on the application of microbial information, with a particular emphasis on postmortem interval estimation (Liu et al., 2022), cause of death inference (Kaszubinski et al., 2020), body fluid type identification (Yao et al., 2021), and individual identification (Watanabe et al., 2018).
Numerous studies have revealed that everyone has a unique microbiome and differs, including between MZTs (Stahringer et al., 2012;Abeles et al., 2014;Suzuki et al., 2022;Sukumar et al., 2023). For example, Stahringer et al. (2012) demonstrated that the difference in salivary microbiome 16s rRNA data between MZT individuals was not less than that between dizygotic twins (DZT) and that this difference gradually increased after they lived apart. Based on such differences, primary studies on the MZT identification model constructions have been conducted. For example, Bozza et al. (2022) collected saliva samples from individuals with different relationships (unrelated individuals, MZT, or inner-individual samples) for sequencing and then built a model that can roughly distinguish between MZT and innerindividual samples. However, such studies would include all tested species in the model construction. In theory, the sensitivity of different microbial species to environmental changes should differ, resulting in different stability within the same individual. Therefore, changes in sequencing data of relatively unstable species would partially offset support for the same identification from relatively stable species. Therefore, there is still room for improvement in this field.
In this study, 80 saliva samples were collected from 10 MZT pairs to collect microbiota information via 16s rRNA sequencing. Twelve models were developed, each considering sample collection intervals, sequence data basis, and distance selection. High-value species were selected using Genetic Algorithm (GA) processes in such models. Finally, the two models with the best distinguishing power in the training test (under two different sampling intervals) were selected. The model developed for the short interval samples could completely separate MZT from inner-individual sample pairs in training and test sets.
. Materials and methods

. . Sample collection
The present study included 10 pairs of MZT (four male and six female pairs). Participants signed informed consent forms before data collection, and the research content met the medical ethics requirements of Hebei Medical University. For each participant, four freely flowing saliva samples (1 mL each) were collected at four different time points (TP 1-4), the second, third, and fourth of which were collected 12, 13, and 14 months after the first one. Each participant stated that they had not used antibiotics within 6 months before sampling and had not consumed food or water within 2 h of sampling. A total of 80 samples were collected and labeled based on the participant and the time point. For example, "S10A1" denoted the sample belonging to individual A of the 10th MZT pair and was collected at the first time point. Genomic DNA was extracted from the saliva samples using the TGuide S96 Magnetic Soil/Stool DNA Kit (TIANGEN Biotech, Beijing) as recommended by the manufacturer and stored at −80 • C until needed.

. . Library preparation and sequencing
A two-round-PCR workflow was used for the library preparation for the 16s rRNA V3-V4 sequencing that involves four steps: (i) First round PCR: The following primers were used to amplify the target regions of 16S rDNA: 338F: 5 ′ -ACTCCTACGGGAGGCAGCA-3 ′ and 806R: 5 ′ -GGACTACHVGGGTWTCTAAT-3 ′ . PCR was performed in 10 µL reactions, with 0.3 µL forward primers (10 µM), 0.3 µL reverse ones (10 µM), 50 ng ± 20% template DNA, 0.2 µL KOD FX Neo, 5 µL KOD FX Neo Buffer, and 2 µL dNTP (2 mM of each The first round PCR product was used as a template in the second round PCR to add index labels to each sample. PCR was performed in 20 µL reactions containing 5 µL of the first PCR round product, 2.5 µL MPPI-a (2 µM), 2.5 µL MPPI-b (2 µM), and 10 µL 2 × Q5 HF MM. The PCR program consisted of 98 • C for 30 s, followed by 10 cycles of 98 • C for 10 s, 65 • C for 30 s, and 72 • C for 30 s, with a final extension at 72 • C for 5 min. (iii) Quantifying and sample mixing: The integrity of the second round PCR product for each sample was tested using 1.8% agarose gel electrophoresis (120 V, 40 min), and then quantitative analysis was performed based on electrophoresis findings using ImageJ software. The samples were mixed by equal mass based on the quantification results. (iv) Purifying and recovering: The mixed samples were purified using OMEGA DNA purification columns. The fragments with target lengths were then cut and recovered in a second round of 1.8% agarose gel electrophoresis (120 V, 40 min). Then, sequencing was performed on the Illumina Novaseq 6000 sequencing platform using a double-end sequencing strategy (PE250, Biomarker Technologies Co, Ltd., Beijing, China) according to standard protocols.

. . Preliminary analysis of microbial diversity
Several steps were taken on the original sequence data to obtain high-quality reads: (i) FLASH (Magoč and Salzberg, 2011) (version 1.2.11) was used to splice the original sequencing data and obtain Raw Tags, with a minimum overlap length of 10 bp and a maximum mismatch ratio set as 0.2; (ii) Trimmomatic (Bolger et al., 2014) (version 0.33) was used to filter the spliced sequence and obtain the Clean Tags: Raw Tags were checked sequentially using a 50 bp window, once the average quality value within the window was <20, the subsequent bases would be truncated. Truncated Tags with <75% length would be filtered; (iii) UCHIME (Edgar et al., 2011) (version 8.1) was used to obtain Clean Reads by removing chimeras. The threshold for confirming a chimera was set at 80% similarity between the query sequence and parent sequences selected from the reference database.
Two strategies were employed to achieve two types of microbial information (i.e., the "feature" to be selected) based on these filtered high-quality sequences: (i) The UPARSE algorithm was used on USEARCH software version 10.0 (Edgar, 2013) to cluster the sequences at a level of 97% similarity to obtain Operational Taxonomic Unit (OTU); and (ii) the dada2 workflow (Callahan et al., 2016) on QIIME2 platform (Bolyen et al., 2019(Bolyen et al., ) (2020) was used to perform noise reduction and to obtain Amplicon Sequence Variant (ASV, i.e., OTU with a level of 100% similarity). The OTUs/ASVs with <0.005% of total reads were filtered out (Bokulich et al., 2013). The SILVA database Release 138.1 (Quast et al., 2012) was used as a reference for species annotation based on ASV information, using a naive Bayesian classifier combined with alignment. RDP Classifier version 2.2 (Wang et al., 2007) was used as the annotating tool. The abundance information of various species at different taxonomic levels was then summarized using QIIME2.
The Chao 1, ACE, Shannon-wiener, and Simpson indices were calculated using Mothur v.1.30 (Schloss et al., 2009) for alpha diversity analysis for each sample. In addition, principal coordinate analysis (PCoA) was performed using the distance matrix of beta diversity parameters (such as Bray-Curtis distance).
. . Feature selection to promote the distinguishing ability of microbial model in MZT identification . . . Division of data sets There would be eight samples for each MZT pair, resulting in C 2 8 = 28 sample pairs. All the 280 pairs (28 sample pairs per MZT pair × 10 MZT pairs) were divided into four groups based on two dimensions: (i) the relationship between the individuals from which the two samples were collected (samples from the same individual were labeled as "Self, " and samples from different individuals within an MZT pair were labeled as "MZT"); and (ii) the sample collection interval ( 2 months was labeled as "short", or 12 months was labeled as "long"). The sample pairs in the four groups were randomly divided into training or test sets with an 8:2 ratio, where the test set is used only for the validation of the final models. The sample amounts in each group are listed in Table 1. . . . Feature type and beta distance parameter choice As stated in Section 2.3, two types of features (i.e., microbial sequencing information) would be used as the data basis for model construction: OTUs or ASVs. Three types of beta diversity distances were calculated based on these two types of features: (i) Jaccard distance (JD), which only considers whether a specific feature was detected in two samples. JD between two samples, A and B, is denoted as JD (A, B) and calculated as shown in Equation (1), if the symbol "|A∩B|" denoted the number of features detected in both samples A and B, and "|A∪B|" represented the total number of features detected in the two samples.
(ii) Bray-Curtis Distance (BC), which represents the difference in absolute abundance information between two samples of the  (2), if symbols "a i " and "b i " denote the absolute abundance of the ith feature in samples A and B, respectively.
(2) (iii) Hellinger distance (HD), which calculates the difference in relative abundance information between two samples of the tested features, calculated as Equation (3), if " j a j " and " j b j " denote the total abundance of all features considered in samples A and B, respectively.
In summary, six possible models could be constructed based on data basis and distance choices for distinguishing the Self and MZT pairs under each sampling interval. Therefore, a total of 12 models will be constructed. The distinguishing ability of the 12 choices would be measured based on all features detected. The area under the receiver operating characteristic curve (AUC) would be used in such measurements. The AUC parameter serves as a commonly used metric for evaluating binary classifiers. For each of the 12 models, the following steps can be taken to calculate the AUC: Firstly, calculation results for all sample pairs are sorted from smallest to largest and assigned rank numbers, during which the average rank would be assigned for pairs with identical results. Secondly, if the number of MZT pairs and Self pairs are indicated as n M and n S , respectively, and the sum of ranks in MZT pairs is represented with symbol "R M , " the AUC can be obtained by using the following equation: .

. . Feature selection based on genetic algorithm
To construct more efficient discrimination models, GAs were performed 12 times on MATLAB (R2021a update 4, MathWorks, United States) to screen the high-value OTUs or ASVs in the identification models. Since their introduction in the 1970s, GAs have been widely used in feature selection. The algorithm uses mathematical methods and computer simulation operations to transform the process of solving complex problems into processes similar to the chromosome genes crossover and mutation in biological evolution. Then it searches for the optimal solution by simulating natural evolution processes.
The operational process of GA involves several fundamental concepts that are presented in bold and italicized words: (i) An individual presents a potential solution to the problem. Each individual is a digital string of the same length as the number of detected features in all samples. Each position on the string is called a gene, one-to-one corresponding to the detected features. Each gene has two possible states called alleles, 1 if the corresponding feature is selected or 0 if not. (ii) A population is a group of individuals. Besides the initial population [P(0)], populations would be generated iteratively and denoted as P(x), where "x" denote the iteration generation and x = 10,000 is set as the first termination condition. (iii) The value of an individual in MZT identification would be measured with the "fitness function." In the present study, the corresponding distance parameter would be calculated for each pair in the training set with the corresponding interval based on the features being selected according to a specific individual (i.e., the corresponding allele being 1); then AUC would be calculated with Equation (4) and used as the fitness function. The second termination condition would be AUC = 1, indicating that the two groups can be completely separated.
The GA operation would be carried out in the following steps for each of the 12 model construction processes: (i) Initialization: Randomly generate 2,000 individuals to form P(0). Calculate AUC for each individual; if the second termination condition is fulfilled, output the optimal solution directly; otherwise, enter the evolution process.
(ii) Evolution: Three steps would be taken to generate a new population, P(x+1), from the existing one, P(x): (a) Selection: Individuals with the lowest 10% AUC in P(x) would be eliminated. According to their AUCs, 2,000 "fathers" and 2,000 "mothers" would be selected from the remaining 90% P(x) individuals according to their AUC: (b) Crossover: Individuals in P(x+1) would inherit alleles from their "parents, " alleles from both of which have the same probability of being inherited for each gene.
(c) Mutation: A part of the alleles in the next generation would be reversed between 0 and 1. In the present study, the mutation rate, i.e., the probability of such reversion occurring, was 1%.
(iii) Evaluation and iteration: Calculate AUC for each individual of P(x+1). If any of the two termination conditions are fulfilled, output the optimal solution and terminate the process; otherwise, repeat the evolution process in step (ii).
After the 12 GA processes, select the two feature-distance combinations with the best AUCs for each type of interval to build the MZT identification model.

. . Introducing Bayes theory in the final models
After feature selection, the specific distance between a sample pair can be used to infer the relationship between the sourcing individuals. However, in some cases, it may be necessary to answer the question, "How confident are we in judging the relationship between these two individuals." The likelihood ratio calculation could be helpful in answering the question.

. . . Basic theories
When only one individual of the MZT pairs is available as the suspect for MZT identification, there are more than two exclusive assumptions (Bozza et al., 2022), such as H 1 : the query saliva trace originates from the suspect; or H 2 : the query saliva trace originates from the suspect's MZT sibling. The present study aimed to estimate the probability of H 1 being true, considering the distance obtained by sequencing and calculating, i.e., Pr(H 1 |D = d) where "d" stands for the specific distance value and "D = d" denote the event that d is calculated between the two samples. According to the Bayes Rules, such a probability can be calculated by Equation (6): Where Pr(H 1 ) denote the probability of H 1 being true without considering the microbial evidence, and Pr(D = d) denote the probability of two individuals exhibiting corresponding distance. Pr(D = d) is usually difficult to calculate accurately because there may be infinite hypotheses, resulting in a high difficulty in accurately calculating Pr(H 1 |D = d). However, calculating the ratio of it to the probability of another hypothesis being true would be much easier, such as: Where the ratio Pr(H 1 )/Pr(H 2 ) is known as the ratio of prior probabilities, and if the probability of any hypothesis being true is assumed to be equal, such a ratio should equal 1. Thus, if we define a parameter likelihood ratio (LR or BF in some studies; Bozza et al., 2022) as Equation (8), it can represent the degree to which H 1 is more likely to be true than H 2 . When considering a determined value, the probability ratio should equal the ratio between the probability density in the Self group and that in the MZT group, labeled withf : .

. . Model construction based on LR results
To obtain the two probabilities in Equation (8), the distribution of the specific distance between Self or MZT sample pairs must be estimated while considering the time interval. Gaussian Kernel density estimation (KDE) was used as the estimation method. Assume that n pairs of specific samples are tested, and the specific distance between each pair is labeled as X 1 , X 2 , ..., and X n , then the probability density can be estimated by Equation (9): And the bandwidth h was set as Equation (10), where δ denotes the standard deviation of all X i .
Thus, the LR of each pair could be calculated by Equation (11) Where n M or n S denotes the number of MZT or Self pairs within a specific time interval, and h M or h S represents the bandwidth of corresponding groups. S i denotes the specific distance between the ith Self pair and M j the distance between the jth MZT pair.
After calculating LR for each of the 224 sample pairs, diagnostic tests were performed between MZT and Self pairs at each time interval. Two types of thresholds would be used in the final models: (i) T max : the LR thresholds that may provide the best Youden index [YI, which can be calculated with Equation (12) 12) . . Validation of the model Based on the two feature-distance combinations selected, LRs would be calculated for each of the 56 sample pairs in the test set, and diagnostic tests would be performed. The validation parameter of the final models would be YI under the two types of threshold sets.

. . Sequencing results
Supplementary Table 1 contains basic information about the 10 MZT pairs: age, gender, and population. After sequencing, 6,134,347 pairs of reads were obtained, with 6,107,044 clean reads remaining after splicing and filtering, as described in section 2.3. At least 48,389 clean reads were obtained for each of the 80 samples, with the mean number of clean reads per sample being 76,338. After clustering at 97% similarity and filtering with a threshold of 0.005% in total reads, 369 types of OTUs were detected across all 80 samples. The number of OTUs obtained per sample ranged from 141 to 332, averaging 230. As shown in Figure 1A, OTUs obtained in samples collected at the first time point for each participant would differ from the other three. Those differed significantly between inner-individual samples could be "burdens" on the identification model, implying the importance of feature selection. A total of 1130 ASVs were obtained after noise reduction using DADA2 and filtering with a 0.005% threshold in the 80  samples. Figure 1B indicates that the number of ASVs obtained per sample ranged from 96 to 260, with an average of 181.
Taxonomic annotation was performed using the SILVA 138 database, as mentioned in Section 2.3. There were 13 phyla detected in all 80 samples, with the top five (Firmicutes, Proteobacteria, Bacteroidota, Actinobacteriota, and Fusobacteriota) accounting for 97%. At the genus level, 131 genera were detected, with Streptococcus and Neisseria accounting for about 21 and 13% of the bacteria in saliva, respectively. The top five genera account for more than half of all species. Figure 2 depicts the taxonomic composition of all samples at the phylum and genus levels, and Supplementary Figure 1 indicates the compositions in each sample. Differences in taxonomic compositions can be found between both Self sample pairs and MZT sample pairs, indicating that microbiota information may have the potential to distinguish MZT. However, feature selection would be required in the model construction.
Basic information of OTUs and ASVs applied in the model construction is presented in Supplementary Table 2, including taxonomic annotation, Reads in each sample and the status of selection in the final model.

. . Preliminary analysis . . . Alpha diversity analysis
Based on ASV information, four alpha diversity indices were calculated for each sample. After calculation, it was found that both ACE and Chao 1 indices of any sample equaled the number of ASV detected from the corresponding sample. For each individual, the four samples collected at different time points can form six comparisons; and for each type of comparison, paired t-test or Wilcoxon sign rank-sum test was performed to assess the difference between the distributions of the alpha indices between the corresponding two time points, according to whether the differences between all samples fit the normal distribution. As shown in Figure 3 and Table 2, if the time interval was <2 months within the same individuals, no significant difference could be found regardless of which index was calculated. More significant differences could be found between samples collected at the first time point and the other three if Shannon and Simpson's indices were more calculated than the other two.

. . . PCoA based on beta diversity analysis
The three types of beta diversity parameters mentioned in Section 2.4.2 were used to perform principal coordinate analysis as shown in Figure 4, indicating that the results were similar to each other. The samples collected at the first time point were significantly different from other samples, i.e., the difference between the collection times point would overcome the difference between MZT pairs.

. . Feature selection with GA processes
As mentioned in Section 2.4.1, the 80 samples could form 280 Self or MZT sample pairs and be divided into four groups based on the relationship of the sourcing individuals and the time interval. Each group was randomly divided into training and test sets in an 8:2 ratio. A total of 12 GA processes were performed based on different feature-distance combinations in the training set, with the best AUC of each population generation listed in Supplementary Figure 2. Except for two GAs that stopped due to satisfying AUC = 1, the generational-best AUC of the remaining 10 GAs fluctuated within a relatively flat range after 1,000-1,500 iterations. The final output result can be considered  close to the optimal solutions. Table 3 indicates the best AUCs for each GA, each of which is better than the corresponding AUC if all species detected were considered. Supplementary Figure 3 depicts the distance distributions based on the corresponding selected species and the KDE results. If the time interval is short and JD is used, the two groups could be completely separated, regardless of the choice in OTUs or ASVs. However, the KDE results indicate that if the final chosen combination of ASVs is used, the gap between the two curves would be larger, implying a better distinguishing ability. Meanwhile, when the time interval is long, the ASV-JD combination can also provide the best AUC. Therefore, ASV-JD combinations are selected for further model construction. As mentioned in Section 3.1, the status of whether each ASV is applied in the final model is annotated in Supplementary Table 2B.

. . Construction and validation of the final models
In the present study, the training set comprised 224 sample pairs. JD was calculated for each sample pair based on one of the two ASV combinations, as selected in Section 3.3, according to the time interval between the samples. Further, probability density curves were obtained for the four groups by processing the calculation results using KDE. These groups were segregated based on the relationship and time interval between the samples. Then, LR was computed for each of the 224 sample pairs and the corresponding results are shown in Table 4 (the first 4 rows) and Figures 5A, B. In the training set, the minimum LR of the Short-Self group was 0.8152, which was slightly higher than the maximum LR .
/fmicb. . of Short-MZT group (0.3638). Therefore, if the confirming Self pair threshold is set between the two LR values (e.g., 0.4), the two groups in the training set can be completely separated, i.e., YI = 1. If LR = 1 is set as the threshold, one in 48 Short-Self pairs will be identified as an MZT pair, while no MZT pair will be mistakenly identified, i.e., YI = 0.9792. Meanwhile, in the training set, the minimum LR of the Long-Self group was 0.9553, and 3 Long-MZT pairs had LR higher than that. The best YI was achieved when the threshold was set between 1.2214 and 1.2548 when the Long-Self pair with the minimum LR would be the only pair incorrectly identified, and YI = 0.9792. If LR = 1 is set as the threshold, three Long-MZT pairs and one Long-Self pair would be mistakenly identified, resulting in YI = 0.9167. Furthermore, JD was calculated similarly for each sample pair in the test set, and then LR was calculated based on the above mentioned probability density curves. The two types of threshold sets were validated in the test set, as shown in Table 4 (the last four rows) and Figures 5C, D (the two lower sub-figures). Comparing to the training set, the model under short intervals could provide a similar YI in the corresponding test set. In contrast, two types of YI would drop sharply for identification between Long-Self and Long-MZT groups, from ∼0.9 in training set to 0.5 (with both sensitivity and specificity being 0.75) in test set.

. Discussion
In the present study, MZT identification models were built using microbiota 16s rRNA V3-V4 region information from 80 saliva samples collected from 10 pairs of MZT individuals while considering the time interval of sample collections. GAs were used to select high-value species-distance combinations for such models. As a result, JD calculated from 516 selected ASVs was used to construct a model that could completely separate MZT sample pairs collected in a short time interval ( 2 months, i.e., Short-MZT pairs) from Short-Self pairs in both the training and test sets. The distinguishing power of such a model was improved when compared to all 1,130 detected types of ASVs with the same distance parameter. A similar improvement in the GA process could be found for sample pairs with long time intervals ( 1 year), and .
/fmicb. . 510 ASVs were selected to construct a model that would make correct decisions for 95 out of 96 pairs of the training set. However, the accuracy of such a model would drop sharply in the test set, where 6 of 24 pairs were misjudged, indicating the probability of over-fitting. It is imperative that applied biomarkers exhibit individual specificity (i.e., distinction between different individual) to distinguish between MZT pairs effectively. Microbiota information is a more promising avenue for this purpose than the traditional genomic biomarkers, such as short tandem repeats (STRs) or single nucleotide polymorphisms (SNPs), which theoretically should be identical between MZTs. The first phase of the Human Microbiome Project (HMP) demonstrated that each collected sample contained a unique set of microorganisms (Human Microbiome Project Consortium, 2012), and multiple studies have highlighted the occurrence of similar differences between MZTs (Bowyer et al., 2022;Sukumar et al., 2023). The composition of an individual's microbiota undergoes a succession process throughout their lifetime (Martino et al., 2022), which can be influenced by several factors such as dietary habits, antibiotic treatments (Bokulich et al., 2016), diseases, or stochastic factors (Turnbaugh et al., 2010). While these factors shape the individual specificity of microbiota communities, they also continuously affect the stability of the communities. And another important requirement of the MZT distinguishing biomarkers is inner-individual stability, which ensures that the difference between the corresponding biomarkers from inter-individual samples is highly likely to exceed the difference between inner-individual samples. The sensitivity of different microorganisms to the above-mentioned factors should differ; the more sensitive species would significantly affect the inner-individual stability and thus "drag down" the application value of the MZT identification models. This could partially explain the improved performance of the model after the feature selection process, as represented by AUC (as illustrated in Table 3).
Based on ASV data, four types of alpha diversity parameters were calculated for each of the 80 samples. ACE and Chao 1 estimators measure species richness among these parameters, while Shannon and Simpson's indices consider community evenness. Analysis of Figure 3 reveals no significant differences in innerindividual samples collected at TP 2-4. Meanwhile, significant differences were identified between these samples and those collected at TP 1 when Shannon and Simpson's indices were used. This suggests that microbiota composition can be stable over a relatively short period of time. In contrast, calculating the ACE or Chao 1 estimator only revealed a significant difference between samples with the longest time intervals (14 months between TP 1 and TP 4). This indicates that the inner-individual stability of microbial species richness is superior to the stability considering community composition. This finding may help to explain why subsequent model constructions based on JD (which also considers only species richness) performed better under the same conditions than models based on the other two types of distances (which consider the community composition).
The calculated ACE and Chao 1 indices for each sample equaled the detected ASV of the corresponding sample. The reason would be that after being confirmed as ASVs, sequences were filtered with a 0.005% threshold, making the minimum reads of the final used ASVs > 48,389 × 0.005% = 2.41945. Consequently, parameters "F1" and "F2" used in the calculation of ACE and Chao 1 indices were set to zero for each sample, and the two indices equaled the observed ASV number, as shown in Equation (13).
Chao 1 = S obs + F1(F1−1) 2(F2+1) = S obs (13) Where S obs , S abund , and S rare denote the total number of observed ASVs, the number of ASVs with large reads, and the number of ASVs with small reads; F1 and F2 represent the number of ASVs with reads 1 and 2, respectively. C ACE and γ 2 ACE can be calculated with several additional steps, but their values do not affect the final result when F1 = 0.
Based on the data presented in Supplementary Figure 3, it appears that distances exhibited a unimodal distribution across all 24 groups, accounting for three types of distances, two types of time intervals, two types of databases, and two types of sourcing individuals' relationships. Therefore, the conversion of LRs based on the aforementioned distributions results in a rough negative correlation with distance results. Consequently, there was no significant change in the numerical ranking of calculation results between samples, indicating that adding LR did not improve the distinguishing capability of the specific model in this study. Nevertheless, LR can be an effective tool for evaluating evidence .

FIGURE
Results of principal coordinate analysis (PCoA). PCoA was performed using three types of beta density parameters: (i) Dots: Jaccard distance; (ii) Triangles: Bray-Curitis distance; (iii) Rhombus: Hellinger distance. Each point represents a sample, which is colored according to the time point of sample collection. Dotted lines connect samples collected from an MZT pair at the same time. credibility with respect to specific distance calculations. It enables a numerical representation of our confidence level in making a particular decision when faced with definite distance outcomes in real-world situations.
Limitations can be found in the present study. For example, unrelated sample pairs from different MZT pairs were not considered during the model construction process in this study. Fresh saliva samples were used, from which the host's DNA .
/fmicb. .  (12). * * Tmax: the LR thresholds could provide the best YI in the training sets. should be easily obtained. Therefore, individual identification issues other than MZT cases can be solved using traditional host-genome-level biomarkers like STRs or SNPs. If it is difficult to obtain the host's DNA, the sample is likely collected from another part of the body or is not fresh. The reference significance of results from fresh saliva samples to such cases would be reduced. This also highlights a problem that hinders using microorganisms in individual identification. The composition of microbial communities is greatly affected by the source body sites (Ward et al., 2018) and the time of isolation (Liu et al., 2022).
Another challenge to applying microbial 16s rRNA data in the forensic field is the lack of a standardized method of sequencing nomenclature, i.e., specific generic names for detected sequences, similar to the RefSNP (rs) system of SNP markers (Sherry et al., 2001). Because of this, the ASVs or OTUs in all samples were numbered sequentially, e.g., ASV1, and ASV2, which is obviously of a limited reference value. Such a problem would be covered if the model used all of the sequencing information, but it would be widened when feature selection is used. Before establishing the standardized method, researchers must construct their own microbial MZT identification models based on their sequencing . /fmicb. . data, instead of applying existing models, such as the ones we constructed in the present study. Moreover, the present study used a relatively small sample size, which limited the consideration of various factors that could affect the model. For example, recent studies have demonstrated that co-habitation may be the primary factor in bacterial sharing between MZT pairs due to person-to-person oral microbiome transmission between co-living family members (Valles-Colomer et al., 2023). Therefore, there would be an increase in microbiota similarity between MZT pairs with increased co-habitation time, and a decrease after they live apart (Stahringer et al., 2012;Valles-Colomer et al., 2023). The number of co-living and separated pairs among the 10 MZT pairs investigated in the present study were four and six, respectively, which are insufficient for statistical analysis when considering time intervals and dividing data sets. Similar insufficiency would occur when considering gender, age, diseases, and antibiotics applications. Therefore, large-scale research and experimentation considering co-living factors may provide a more accurate assessment of the potential of microbiota information in resolving the MZT identification problem.
In summary, the investigation of the microbial solution to the MZT identification problem is still in its early stages. And thus, rather than recommending a preconceived model, the primary objective of the current research is to underscore the potential utility of feature selection in facilitating the process of model construction when identifying MZT pairs with microbial information.

. Conclusions
The present study used 80 saliva samples from 10 pairs of MZT to develop models for identifying MZT based on their microbiota information. The feature selection method was used to build models that effectively distinguish MZT from Self pairs. When the sampling interval was <2 months, the final model could completely separate the two types of sample pairs. This highlights the potential value of microbiota information as a biomarker in MZT identification. Moreover, the feature selection process demonstrated its ability to refine models and filter out irrelevant data in this field, resulting in more accurate and reliable models.

Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: NCBI-PRJNA983501.

Ethics statement
The studies involving human participants were reviewed and approved by Medical Ethics Committee of Hebei Medical University. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.